Luckily for us, fitting a linear model to some data does not require us to iteratively find the best intercept and slope manually. As it turns out, R can do this much more precisely, and very fast!

Let’s explore how to do this, using a real life dataset taken from the Ecdat package which includes many economics-related dataset. In this example, we will use the Caschooldataset which contains the average test scores of 420 elementary schools in California along with some additional information.

Loading and exploring Data

We can explore which variables are included in the dataset using the names() function:

library("Ecdat") # Attach the Ecdat library

names(Caschool) # Display the variables of the Caschool dataset
##  [1] "distcod"  "county"   "district" "grspan"   "enrltot"  "teachers"
##  [7] "calwpct"  "mealpct"  "computer" "testscr"  "compstu"  "expnstu" 
## [13] "str"      "avginc"   "elpct"    "readscr"  "mathscr"

For each variable in the dataset, basic summary statistics can be obtained by calling summary()

summary(Caschool[, c("testscr", "str", "avginc")])
##     testscr           str            avginc      
##  Min.   :605.5   Min.   :14.00   Min.   : 5.335  
##  1st Qu.:640.0   1st Qu.:18.58   1st Qu.:10.639  
##  Median :654.5   Median :19.72   Median :13.728  
##  Mean   :654.2   Mean   :19.64   Mean   :15.317  
##  3rd Qu.:666.7   3rd Qu.:20.87   3rd Qu.:17.629  
##  Max.   :706.8   Max.   :25.80   Max.   :55.328

Fitting a linear model to the data.

Suppose a policymaker is interested in the following linear model:

\[testscr_i = \beta_0 + \beta_1 \times (str)_i + \epsilon_i\] Where \((testscr)_i\) is the average test score for a given school \(i\) and \((str)_i\) is the Student/Teacher Ratio (i.e. the average number of students per teacher) in the same school \(i\). We can think of \(\beta_0\) and \(\beta_1\) as the intercept and the slope of the regression line.

The subscript \(i\) indexes all unique elementary schools (\(i \in \{1, 2, 3, \dots 420\}\)) and \(\epsilon_i\) is the error, or residual, of the regression. (Remember that our procedure for finding the line of best fit is to minimize the sum of squared residuals (SSR)).


At this point you should step back and take a second to think about what you believe the relation between a school’s test scores and student/teacher ratio will be. Do you believe that, in general, a high student/teacher ratio will be associated with higher-than-average test scores for the school? Do you think that the number of students per teacher will impact results in any way?

Let’s find out! As always, we will start by plotting the data to inspect it visually (don’t worry if the syntax doesn’t make much sense right now, we will come back to it very soon):

plot(formula = testscr ~ str,
     data = Caschool,
     xlab = "Student/Teacher Ratio",
     ylab = "Average Test Score", pch = 21, col = 'blue')

Can you spot a trend in the data? According to you, what would the line of best fit look like? Would it be upward or downward slopping? Let’s ask R!

The lm() function

We will use the built-in lm() function to estimate the coefficients \(\beta_0\) and \(\beta_1\) using the data at hand. The lm() function typically only takes 2 arguments:

lm(formula, data)

In the context of our example, the function call is therefore:

lm(formula = "testscr ~ str",
   data = Caschool)
## 
## Call:
## lm(formula = "testscr ~ str", data = Caschool)
## 
## Coefficients:
## (Intercept)          str  
##      698.93        -2.28

As we can see, R automatically spits out its estimates for the Intercept and Slope coefficients, \(\hat{\beta_0} =\) 698.93 and \(\hat{\beta_1} =\) -2.28. The relationship between a school’s Student/Teacher Ratio and its average test results is negative.

For reasons that are outside the scope of this class, running a linear regression in R is typically a two-steps process. You first assign the output of the lm() call to an object and then call a second function (for our purpose, mainly summary()) on the resulting object. In practice, this looks like this :

# assign lm() output to some object `fit_california`
fit_california <- lm(formula = "testscr ~ str", data = Caschool)

# ask R for the regression summary
summary(fit_california) 
## 
## Call:
## lm(formula = "testscr ~ str", data = Caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

Again, we recognize our intercept and slope estimates from before, alongside some other numbers and indications. This output is called a regression table, and you will be able to decypher it by the end of this course.

Plotting the regression line

We can also use our lm fit to draw the regression line on top of our initial scatterplot, using the following syntax:

plot(formula = testscr ~ str,
     data = Caschool,
     xlab = "Student/Teacher Ratio",
     ylab = "Average Test Score", pch = 21, col = 'blue')# same plot as before

abline(fit_california, col = 'red') # add regression line

As you probably expected, there is a slight downward correlation between a school’s Student/Teacher Ratio and its average test results.

Multivariate Regression

With these tools in place, it is fairly straigtforward to estimate models with more than one dependent variables, for example adding families’ average income to our previous model:

\[testscr_i = \beta_0 + \beta_1 (str)_i + \beta_2 (avginc)_i + \epsilon_i\] We can incoporate this new variable to our model by simply adding it to our formula:

fit_multivariate <- lm(formula = "testscr ~ str + avginc", data = Caschool)
summary(fit_multivariate)
## 
## Call:
## lm(formula = "testscr ~ str + avginc", data = Caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.608  -9.052   0.707   9.259  31.898 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 638.72915    7.44908  85.746   <2e-16 ***
## str          -0.64874    0.35440  -1.831   0.0679 .  
## avginc        1.83911    0.09279  19.821   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.35 on 417 degrees of freedom
## Multiple R-squared:  0.5115, Adjusted R-squared:  0.5091 
## F-statistic: 218.3 on 2 and 417 DF,  p-value: < 2.2e-16

Although it is quite cumbersome and not typical to visualize multivariate regressions, we can still do this with 2 explanatory variables using a regression (2-dimensional) plane.

While you explore this plot, ask yourself the following question: if you could only choose one of the two explanatory variables in our model (that is, either \(str\) or \(avginc\)) to predict the value of a given school’s average test score, which one would you choose? Why? Discuss this with your classmates.